De Novo Genome Assembly ◾ 91
The overlap-consensus algorithm performs pairwise alignment and then it represents the
reads and the overlaps between the reads with graphs, where contiguous reads are the
vertexes (nodes) and their overlaps (common prefix or suffix) are the edges. Each node (ri)
corresponds to a read and any two reads are connected by an edge e r r
(
)
,
1
2 where the suffix
of the first read matches the prefix of the second read. The algorithm then attempts to find
the Hamiltonian path of the graph that includes the vertexes (reads) exactly once. Contigs
are then created from the consensus sequences of the overlapped suffixes and prefixes. The
downside of this method is that it requires a large sequencing coverage and the complexity
of the Hamiltonian path increases with the increase of reads. An example for an assembler
using this approach is Celera Assembler [4]. To show you how the approach of the overlap-
consensus graphs works, assume that we have the reads:
CAGAACCTAGC
ATAGCAGAACG
GCTAAGCAGTG
AGTGTCACGAC
TAGCAACTATG
AACGTAGCAGA
Figure 3.3 and 3.4 show the overlap graphs in which the reads are the nodes and the over-
lapped prefixes and suffixes are the edges. The Hamiltonian path includes each node
exactly once to form the consensus sequence.
3.1.3 De Bruijn Graphs
De Bruijn graph [5, 6] approach for de novo assembly was first introduced in 2001 for assem-
bling short reads produced by the NGS technologies and then it was refined in 2008 [7]. In
de Bruijn graphs, each read is broken into overlapping substrings of length k called over-
lapping k-mers. The k-mers then are represented by graphs in which each k-mer is a vertex
(node). Any two nodes of any two k-mers that share a common prefix and suffix of length
FIGURE 3.2 Greedy assembly.